Run fixpoint per strongly connected component #999

MichaReiser · 2025-10-06T17:20:50Z

This PR rewrites how Salsa runs fixpoint iterations when multiple cylce heads participate in a cycle (when there are multiple nested cycles).

Motivation

The main motivation for this is that we see a runaway in ty for code like this:

self.x1 = 0
self.x2 = 0

self.y1 = self.y2 + self.y3
self.y2 = self.y1
self.y3 = self.y1 + self.y2

self.z1 = self.y1 + self.z2
self.z2 = self.z1

Or as a graph

flowchart TD
    x1
    x2
    
    y1 --> y2
	y1 --> y3
    y2 --> y1
	y3 --> y1
	y3 --> y2

    z1 --> y1 
    z1 --> z2
    z2 --> z1

Today, we'll run separate fixpoint iterations for y1 <-> y2, y3 <-> y1, and z1 <-> z2.

The downside of this is that Salsa runs the nested fixpoint iterations y1<->y3 to convergence for every y1 <-> y2 iteration. If there are more deeply nested cycles, then each of those inner cycles is run to completion for every y1 <-> y2 and y1 <-> y3 iteration, making the number of iterations dependent on the nesting of queries rather than how quickly the problem converges.

To avoid this, the idea is to identify the strongly connected components in the dependency graph. These are:

x1 (no dependencies)
x2 (no dependencies)
y1, y2, y3
z1, z2 (only depends on y1 in one direction but not both)

Then, run a fixpoint per strongly connected component. That means, we only run 2 fixpoint iterations for the example above. This ensures that the runtime is bound by how quickly the problem converges and not by how deeply nested our queries are (which creates a tension between granular queries and performance).

Single-threaded implementation

Let's ignore multi-threading first as this only complicates things ;)

The change in itself isn't that complicated:

execute_maybe_iterate: If the query is part of an outer cycle, then don't remove the cycle head from cycle_heads, set a flag on the memo indicating whether this head has converged, and return the provisional value. The query will iterate as part of the outer cycle
The outer cycle is the query that, when claiming it, results in a cycle (this tells us it's higher up on the stack)
If the query is an outer cycle, then:
- Test if itself and all its inner cycle heads (can be retrieved from memo.cycle_heads) have converged
- If not, update the iteration count on each cycle head and iterate again
- Otherwise, mark all the cycle heads as finalized.

That's roughly it :)

Multithreading

Where things get complicated is multithreading. I tried to keep the existing locking (where we wait on cycle heads in some places), but I wasn't able to make it work (there were spurious errors till the end). That's why I ended up rewriting cycle head synchronization from scratch.

What makes cycles tricky is that multiple threads can enter the same cycle simultaneously, each starting from a different cycle head that ultimately resolves to the same outer cycle. The new implementation makes this even trickier because inner cycles now complete immediately without iterating, which results in them releasing their locks. That allows other threads to claim those locks again, which can result in them competing for the same locks forever.

This PR extends Salsa's DependencyGraph with a mechanism to transfer the ownership of a query to another query. Let's say we have a -> b -> a and b -> c -> b. The inner cycle head can transfer its lock to the query a. b then remains locked until c completes (successfully or panicking). Only the thread owning a (because it's the thread holding that lock or it's another thread on which a is blocked on) is allowed to reclaim b.

This ensures that the locks of all cycle heads participating in a query ultimately converge to be owned by a single query/thread. We need to do this for every query that supports cycle handling, even if it isn't a cycle head, because the query could become a cycle head in later iterations (e.g., due to changes in the cycle-entry point).

I also found this approach much easier to reason about than the old approach of "just wait for all cycle heads".

However, there's a downside. It requires acquiring the global DependencyGraph lock and tracking dependencies whenever claiming or releasing any query with cycle handle support that participates in a cycle. We incur this cost even when all queries participating in the cycle run on the same thread; this is also the main reason why converge-diverge's performance regresses. I'm not sure if there's a solution to avoid this.

TODO:

Documentation
Investigate perf regressions
Review if there are any changes necessary to maybe_changed_after
Investigate why cycle_nested_deep_conditional_changed::the_test sometimes hangs or panics
Add a test for a panic in a nested cycle

netlify · 2025-10-06T17:20:56Z

✅ Deploy Preview for salsa-rs canceled.

Name	Link
🔨 Latest commit	`e2fda86`
🔍 Latest deploy log	https://app.netlify.com/projects/salsa-rs/deploys/68eb3ddb9c07b7000814fe0a

codspeed-hq · 2025-10-06T17:24:54Z

CodSpeed Performance Report

Merging #999 will degrade performances by 14.71%

_{Comparing MichaReiser:fixpoint-scc-sync-table (e2fda86) with master (8b0831f)}

Summary

⚡ 2 improvements
❌ 3 regressions
✅ 8 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`amortized[SupertypeInput]`	3 µs	2.8 µs	+6.78%
❌	`new[Input]`	9.6 µs	10 µs	-4.09%
❌	`new[InternedInput]`	4.3 µs	4.5 µs	-4.46%
❌	`converge_diverge`	125.4 µs	147 µs	-14.71%
⚡	`converge_diverge_nested`	200.8 µs	120.2 µs	+66.98%

MichaReiser · 2025-10-06T17:26:10Z

Hmm, I'm surprised that this is even slower than #995, considering that most logic now lives directly in DependencyGraph. I suppose the downside is that we incur the cost of transferring the outer head in all cases, even if the entire calculation occurs on a single thread. That might be an issue for ty, ugh.

Either way. I think there are some optimization opportunities:

Avoid calling unblock in more cases
Add a method to DependencyGraph to determine if all queries form a cycle (rather than having many wait_for calls)

src/function/maybe_changed_after.rs

MichaReiser · 2025-10-09T08:45:21Z

src/function/maybe_changed_after.rs

+        // If the value is from the same revision but is still provisional, consider it changed
+        // because we're now in a new iteration.


I always found the deep_verify_memo logs for provisional values confusing. Exiting early here ensures they don't show up

MichaReiser · 2025-10-09T14:43:10Z

The amortized benchmark results make no sense to me... I didn't touch fetch_hot at all... #1001 also shows a 7% perf improvement which is complete nonsense.

Fix serde attribute

MichaReiser · 2025-10-10T17:04:48Z

I found one more bug. We hit fetch_cold_cycle should have inserted a provisional memo with Cycle::initial for infer_definition_types(Id(4177c))` after a cyclic query panicked.

I'm not entirely sure yet what's happening

carljm · 2025-10-10T22:34:48Z

@MichaReiser Should I review this now, or better wait for further exploration of this new bug?

MichaReiser · 2025-10-11T14:52:10Z

@MichaReiser Should I review this now, or better wait for further exploration of this new bug?

It's ready. I very much hope that any bugs that I discover now won't require a significant rewrite. The most recent bug related to panicking cycles required a 3-line fix only.

MichaReiser · 2025-10-11T14:58:36Z

src/function/execute.rs

-        let mut opt_last_provisional: Option<&Memo<'db, C>> = None;
+        // This is different from `opt_old_memo` which might be from a different revision.
+        let mut last_provisional_memo: Option<&Memo<'db, C>> = None;
+        // TODO: Can we seed those somehow?


This is #980

MichaReiser · 2025-10-11T15:34:33Z

src/function/fetch.rs

-                    // This isn't strictly necessary, but if this is a provisional memo for an inner cycle,
-                    // await all outer cycle heads to give the thread driving it a chance to complete
-                    // (we don't want multiple threads competing for the queries participating in the same cycle).


This is required for fallback immediate in multi threaded context. I didn't spend any time understanding why

carljm · 2025-10-13T19:52:54Z

src/function/execute.rs

+                //    (we can't rely on `iteration_count` being updated for nested cycles because the nested cycles may have completed successfully).
+                // b) It's guaranteed that this query will panic again anyway.
+                // That's why we simply propagate the panic here. It simplifies our lives and it also avoids duplicate panic messages.
+                if old_memo.value.is_none() {


If value.is_none() always means "panicked" (which is what it looks like is implied here), would it make sense to reflect that more clearly in method naming, e.g. add a memo.panicked() or similar?

carljm · 2025-10-13T23:10:38Z

src/function/execute.rs

+            // t1: a (completes `b` with `c` in heads)
+            //
+            // Note how `a` only depends on `c` but not `a`. This is because `a` only saw the initial value of `c` and wasn't updated when `c` completed.
+            // That's why we need to resolve the cycle heads recursively to `cycle_heads` contains all cycle heads at the moment this query completed.


Suggested change

// That's why we need to resolve the cycle heads recursively to `cycle_heads` contains all cycle heads at the moment this query completed.

// That's why we need to resolve the cycle heads recursively so `cycle_heads` contains all cycle heads at the moment this query completed.

ibraheemdev

I didn't review everything too closely, but I left a couple nits and questions.

src/cycle.rs

ibraheemdev · 2025-10-15T16:42:48Z

src/function/execute.rs

+                // b) It's guaranteed that this query will panic again anyway.
+                // That's why we simply propagate the panic here. It simplifies our lives and it also avoids duplicate panic messages.
+                if old_memo.value.is_none() {
+                    ::tracing::warn!("Propagating panic for cycle head that panicked in an earlier execution in that revision");


Is there a reason we're using the tracing crate directly instead of our wrapper (and in other places as well)? Is it because this is already a cold path?

I tried to add warn to our own tracing but Rust was mad at me for warn being a reserved keyword or something. I then decided not to bother because we only use warn in very cold paths.

src/cycle.rs

ibraheemdev · 2025-10-15T16:53:37Z

src/function/sync.rs


 pub(crate) struct SyncState {
-    id: ThreadId,
+    /// The thread id that is owning this query (actively executing it or iterating it as part of a larger cycle).


Suggested change

/// The thread id that is owning this query (actively executing it or iterating it as part of a larger cycle).

/// The thread id that currently owns this query (actively executing it or iterating it as part of a larger cycle).

ibraheemdev · 2025-10-15T16:57:28Z

src/function/sync.rs

+}
+
+#[derive(Copy, Clone, Debug)]
+pub enum SyncOwnerId {


Nit: Maybe just call this SyncOwner, since it doesn't necessarily hold an ID.

ibraheemdev · 2025-10-15T17:12:18Z

src/runtime/dependency_graph.rs


        /// Signalled whenever a query with dependents completes.
        /// Allows those dependents to check if they are ready to unblock.
-        // condvar: unsafe<'stack_frame> Pin<&'stack_frame Condvar>,


Looks like this was mistakenly removed.

It wasn't clear to me what it documents? Do you think this should be part of the field comment?

Yeah it would be fine as a field comment.

ibraheemdev · 2025-10-15T17:15:17Z

src/function/sync.rs

+    pub(super) fn mark_as_transfer_target(&self, key_index: Id) -> Option<SyncOwnerId> {
+        let mut syncs = self.syncs.lock();
+        syncs.get_mut(&key_index).map(|state| {
+            state.anyone_waiting = true;


Why do we unconditionally set anyone_waiting here?

I can add a comment. It's an optimization so that ClaimGuard::release can always early exit if anyone_waiting is false without having to check is_transfer_target too.

ibraheemdev · 2025-10-15T17:16:17Z

src/function/sync.rs

+    /// Whether this query has been claimed by the query that currently owns it.
+    ///
+    /// If `a` has been transferred to `b` and the stack for t1 is `b -> a`, then `a` can be claimed
+    /// and `claimed_transferred` is set to `true`. However, t2 won't be able to claim `a` because


Suggested change

/// and `claimed_transferred` is set to `true`. However, t2 won't be able to claim `a` because

/// and `claimed_twice` is set to `true`. However, t2 won't be able to claim `a` because

ibraheemdev · 2025-10-15T17:22:04Z

src/function/execute.rs

+                .iter()
+                .find(|active_query| {
+                    cycle_heads.contains(&active_query.database_key_index)
+                        && active_query.database_key_index != current_key


Is this enough to identify strongly connected components? Do we never need to prefer the outermost query for deeply nested cycles?

Finding the outer-most query is somewhat difficult when it stretches across multiple threads. At some point, I had an implementation that queried the DependencyGraph to find the outermost cycle. However, it turned out that finding the outermost cycle was fairly expensive. Just finding "an outer" query is cheaper, and we can leave it to "that outer" query to find the "next outer query". Ultimately, we'll reach the outermost query.

Things are much easier if all queries are on the stack, which is what we handle here. We start from the outermost active query and try to find the first query that is a cycle head. This gives us a "local maximum" for what the outermost query is.

ibraheemdev · 2025-10-15T17:23:29Z

src/function/execute.rs

+                let memo_iteration_count = old_memo.revisions.iteration();
+
+                // The `DependencyGraph` locking propagates panics when another thread is blocked on a panicking query.
+                // However, the locking doesn't handle the case where a thread fetches the result of a panicking cycle head query  **after** all locks were released.


Suggested change

// However, the locking doesn't handle the case where a thread fetches the result of a panicking cycle head query **after** all locks were released.

// However, the locking doesn't handle the case where a thread fetches the result of a panicking cycle head query **after** all locks were released.

AlexWaygood mentioned this pull request Oct 7, 2025

ty never completes checking astropy (looks like a runaway memory leak) astral-sh/ty#1315

Open

MichaReiser force-pushed the fixpoint-scc-sync-table branch 2 times, most recently from a2b3da6 to 058fa5b Compare October 8, 2025 17:24

MichaReiser commented Oct 8, 2025

View reviewed changes

src/function/maybe_changed_after.rs Outdated Show resolved Hide resolved

MichaReiser commented Oct 9, 2025

View reviewed changes

MichaReiser force-pushed the fixpoint-scc-sync-table branch from ab90dd2 to dbb3eb0 Compare October 9, 2025 12:41

MichaReiser mentioned this pull request Oct 9, 2025

Add benchmark for a fixpoint iteration with nested cycles #1001

Merged

MichaReiser added 19 commits October 9, 2025 17:59

Run nested cycles in a single fixpoint iteration

3bf0ceb

Fix serde attribute

Remove inline from validate_same_iteration

702fc10

Nits

6264d7a

Move locking into sync table

f9aca5b

More trying

2649303

More in progress work

202fe55

More progress

45fefe9

Fix most parallel tests

86e8e3c

More bugfixes

65d973f

Short circuit in some cases

3efde91

Short circuit in drop

995acd7

Delete some unused code

445f812

A working solution

824559a

Simplify more

f7f630e

Avoid repeated query lookups in transfer_lock

307aae5

Use recursion for unblocking

9c79821

Fix hang in maybe_changed_after

4dc0939

Move claiming of transferred memos into a separate function

2273fc7

More aggressive use of attributes

8d5dab7

MichaReiser added 3 commits October 10, 2025 13:42

Fix panic

8d786ae

Fix persistence test

5813326

Add test for panic in nested cycle

81137dd

MichaReiser force-pushed the fixpoint-scc-sync-table branch from 3ef0cbe to 81137dd Compare October 10, 2025 12:43

MichaReiser added 2 commits October 10, 2025 14:56

Allow cycle initial values same-stack

282382a

Try inlining fetch

d6f0f20

MichaReiser force-pushed the fixpoint-scc-sync-table branch from 06826c6 to d6f0f20 Compare October 10, 2025 13:09

MichaReiser added 2 commits October 10, 2025 15:36

Remove some inline attributes

e532693

Add safety comment

14c64ee

MichaReiser mentioned this pull request Oct 10, 2025

Update Salsa astral-sh/ruff#20645

Draft

Clippy

bb97d96

MichaReiser marked this pull request as ready for review October 10, 2025 14:09

MichaReiser requested review from carljm and ibraheemdev and removed request for carljm October 10, 2025 14:10

Panic if provisional_retry runs too many times

7bb93a9

Better handling of panics in cycles

c5cc3dd

MichaReiser force-pushed the fixpoint-scc-sync-table branch from e5a0069 to c5cc3dd Compare October 11, 2025 14:55

MichaReiser commented Oct 11, 2025

View reviewed changes

Don't use const-generic for REENTRANT

7b39c24

MichaReiser commented Oct 11, 2025

View reviewed changes

MichaReiser added 3 commits October 11, 2025 17:56

More nit improvements

054c469

Remove IterationCount::panicked

f89c430

Prefer outer most cycles in outer_cycle

e2fda86

carljm reviewed Oct 14, 2025

View reviewed changes

ibraheemdev approved these changes Oct 15, 2025

View reviewed changes

		// If the value is from the same revision but is still provisional, consider it changed
		// because we're now in a new iteration.

	// That's why we need to resolve the cycle heads recursively to `cycle_heads` contains all cycle heads at the moment this query completed.
	// That's why we need to resolve the cycle heads recursively so `cycle_heads` contains all cycle heads at the moment this query completed.

	/// The thread id that is owning this query (actively executing it or iterating it as part of a larger cycle).
	/// The thread id that currently owns this query (actively executing it or iterating it as part of a larger cycle).

	/// and `claimed_transferred` is set to `true`. However, t2 won't be able to claim `a` because
	/// and `claimed_twice` is set to `true`. However, t2 won't be able to claim `a` because

	// However, the locking doesn't handle the case where a thread fetches the result of a panicking cycle head query after all locks were released.
	// However, the locking doesn't handle the case where a thread fetches the result of a panicking cycle head query after all locks were released.

Run fixpoint per strongly connected component #999

Are you sure you want to change the base?

Run fixpoint per strongly connected component #999

Conversation

MichaReiser commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Single-threaded implementation

Multithreading

Uh oh!

netlify bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for salsa-rs canceled.

Uh oh!

codspeed-hq bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #999 will degrade performances by 14.71%

Summary

Benchmarks breakdown

Uh oh!

MichaReiser commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaReiser commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaReiser commented Oct 10, 2025

Uh oh!

carljm commented Oct 10, 2025

Uh oh!

MichaReiser commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaReiser Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibraheemdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

MichaReiser commented Oct 6, 2025 •

edited

Loading

netlify bot commented Oct 6, 2025 •

edited

Loading

codspeed-hq bot commented Oct 6, 2025 •

edited

Loading

MichaReiser commented Oct 6, 2025 •

edited

Loading

MichaReiser commented Oct 9, 2025 •

edited

Loading

MichaReiser commented Oct 11, 2025 •

edited

Loading

MichaReiser Oct 11, 2025 •

edited

Loading